1. Introduce data and background from assignment 1 and 2

In assignment 1, I choosed the data Series GSE109161, data from an experiment about Chimeric Antigen Receptor (CAR) ligation transcriptional changes. The original paper is listed in “References”(Salter et al. 2018 , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6186424/).

The downloaded supplmentary data include 18794 rows of symbols and 24 samples. In Assignment 1 I filtered weakly expressed features as well as rows that don’t match any HGNC symbols and then performed nomrlization. In Asignment 2 I defined the sample columns and desiged models.

Assignment 1 and 2 reports are included as child documents at the end of this report.

To reload the filtered result from Assignment 1:

car_exp_filtered <- read.csv("./data/result.csv",header=TRUE,stringsAsFactors=FALSE)
dim(car_exp_filtered)
## [1] 12815    25

The 24 columns in the data are from 24 samples(3 replicates * 2 time points * 2 samples for CD28/CD3\(\zeta\) and control * 2 samples for 4-1BB/CD3\(\zeta\) and control). Both CD28/CD3\(\zeta\) and 4-1BB/CD3\(\zeta\) samples are CAR stimulated samples only with two different CAR types.

In order for this notebook to compile, I ’ll repeat a few codes in Assignment 2 to define sample and model design.

filtered_data_matrix <- as.matrix(car_exp_filtered[,2:25])
rownames(filtered_data_matrix) <- car_exp_filtered[,1]
samples <- data.frame(lapply(colnames(filtered_data_matrix)[1:24],
                             FUN=function(x){unlist(strsplit(x, split = "\\_"))[c(2,3,4)]}))

colnames(samples) <- colnames(filtered_data_matrix)[1:24]
rownames(samples) <- c("CAR_type", "time_point", "stim_ctrl")
samples <- t(samples)
samples[samples[,"CAR_type"] =="A42",1]="4-1BB/CD3z"
samples[samples[,"CAR_type"] =="A44",1]="4-1BB/CD3z"
samples[samples[,"CAR_type"] =="A43",1]="CD28/CD3z"
samples[samples[,"CAR_type"] =="A45",1]="CD28/CD3z"
samples <- as.data.frame(samples)
model_design_stim <- model.matrix(
  ~ samples$stim_ctrl)

2. Non-thresholded Gene set Enrichment Analysis

2.1 Perform GSEA

Assignment 2 prodcued upregulated and downregulated list by thresghold method, here use non-threshold method to produce ranked gene list.

library(edgeR)
d = DGEList(counts=filtered_data_matrix, group=samples$stim_ctrl)
d <- estimateDisp(d, model_design_stim)
fit <- glmQLFit(d, model_design_stim)
qlf.stim <- glmQLFTest(fit)
rankstim <- topTags(qlf.stim,sort.by = "logFC",
                    n = nrow(filtered_data_matrix))
rankstim <- data.frame(genes=row.names(rankstim),rankstim)

The above code used “model_design_stim”, which contrast the samples marked as stimulated, to samples marked as control.

To define the “rank” column, I based rank on the logFC value as differential statistics.

rankstim$rank <- rankstim$logFC
rankstim <- rankstim[order(-rankstim$rank),]
rankstim <- rankstim[,c(1,7)]
head(rankstim)
##       genes     rank
## IL13   IL13 9.497257
## IL3     IL3 8.500301
## IFNG   IFNG 8.190831
## CCL3   CCL3 8.090140
## IL5     IL5 7.968311
## XIRP1 XIRP1 7.855486
tail(rankstim)
##             genes      rank
## RAP1GAP2 RAP1GAP2 -3.533432
## GPA33       GPA33 -3.598175
## CACNA1I   CACNA1I -3.619538
## GFPT2       GFPT2 -3.633567
## AMIGO3     AMIGO3 -3.648348
## FCRL6       FCRL6 -3.813774
write.table(x=rankstim,file=("./data/carrankstim.rnk"),sep = "\t",
            row.names = FALSE,col.names = TRUE,quote = FALSE)

The rank file “carrankstim.rnk” is created as above, prepared as Preranked GSEA input.

At the top of this rank file are genes with greatest positive fold change for stimulated samples relative to control samples. At the bottom of this file are genes with greatest (absolute value) negative fold change for stimulated samples relative to control samples.

I used GSEA java GUI application (v4.0.3) running on my local machine to perform Non-thresholded Gene set Enrichment Analysis. Used Preranked GSEA procedure, use the above carrankstim.rnk as rank list, used bader lab current version with GO biological process: “Human_GOBP_AllPathways_no_GO_iea_March_01_2020_symbol.gmt” as gene set file. Parameters for running include: “Number of permutations, 1000”, “No_collaps to gene symbol”, “Max size for gene sets, 200”, “min size for gene sets, 15”.

The running of GSEA took approximately 5 minutes.

Listed below is the report summarize:

GSEA Report for Dataset carrankstim

Enrichment in phenotype: na
4077 / 4965 gene sets are upregulated in phenotype na_pos
2265 gene sets are significant at FDR < 25%
1296 gene sets are significantly enriched at nominal pvalue < 1%
1794 gene sets are significantly enriched at nominal pvalue < 5%

Enrichment in phenotype: na
888 / 4965 gene sets are upregulated in phenotype na_neg
61 gene sets are significantly enriched at FDR < 25%
50 gene sets are significantly enriched at nominal pvalue < 1%
111 gene sets are significantly enriched at nominal pvalue < 5%

Dataset details
The dataset has 12815 features (genes)
No probe set => gene symbol collapsing was requested, so all 12815 features were used

Gene set details
Gene set size filters (min=15, max=200) resulted in filtering out 13440 / 18405 gene sets
The remaining 4965 gene sets were used in the analysis

Global statistics and plots
Plot of p-values vs. NES
p-values vs. NES
Global ES histogram
global es histogran

The whole report was uploaded and can be seen from this link.

2.2 Interpret the result

Phenotype Phenotype is not available because as the assignment required, here I performed the Preranked GSEA, which does not specify phenotype in input. Only standard GSEA need a class file to specify phenotype for every column of samples. However, the rankfile was based on my model design which specified the phenotypes “stim” and “ctrl” for every samples, rank was calculated as fold change of stimulated samples vs control samples.

Main metrics of GSEA
GSEA is actually a statistical test on whether the gene sets are significantly overrepresented on the top or bottom of the rank list. Specifically it used Kolmogorov–Smirnov statistics.

Postive ES(NES) gene sets with positive ES and NES values are those correspond to (overrepresented in) genes at the top of the rank file, which are genes with postive fold change of stimulated samples vs control samples .

Negative ES(NES) gene sets with negative ES and NES values are those correspond to (overrepresented in) genes at the bottom of the rank file, which are genes with negative fold change of stimulated samples vs control samples .

NES normalized ES so accounts for differences in gene set size and in correlations between gene sets and the rank list. To normalize GSEA needs permutation, for preranked GSEA permutation can only use gene sets permutation because no phenotype available.

FDR is the false discovery rate, FDR is adjusted for gene set size and multiple hypotheses testing while the nominal p-value is not. So for choosing the significant gene sets we need both thresholds.

Global staistics
The p-values vs. NES plot shows there are plenty of gene sets with smaller than 1% nominal p-values, but for building enrichment map I decided to take more stringent theshold, only look into smaller than 0.005 p-values gene sets. The 25% FDR threshold was marked as yellow on the plot. We can see there are enough gene sets satisfy both FDR 25% and p-value 0.005 thresholds, but the gene sets with negative NES that satisfy both thresholds are much less, because their FDR values are higher on average.
The Global ES histogram showes there are less gene sets with negative ES than with postive ES.
Combining results from the two plots we can predict there will much less gene sets with nagtive NES satify thresholds.

Representative gene sets
Positive NES
“4077 / 4965 gene sets are upregulated in phenotype na_pos”, are those gene sets with postive NES, on top of this list is “HALLMARK_E2F_TARGETS%MSIGDB_C2%HALLMARK_E2F_TARGETS”, which is a gene set contends 200 genes related to cell cycle control.

According to the Gene set enrichment paper (Subramanian et al. 2005, https://www.pnas.org/content/102/43/15545), “The leading-edge subset can be interpreted as the core of a gene set that accounts for the enrichment signal”. The above gene set has 149 genes listed as “CORE ENRICHMENT” genes, so should be intepreted as leading edge genes, which comprise majority of the 200 genes in the gene set. If we look at the enrichemnt plot of this gene set, then these genes correspond to genes in the ranked list prior to the peak score.
en plot for E2F

Negatice NES
“888 / 4965 gene sets are upregulated in phenotype na_neg” are those gene sets with negative NES, on top of this list is:“CILIUM MOVEMENT%GOBP%GO:000334”.This is a gene set with only 18 genes related to cilium movement.
Since the size of this gene set is small I manually checked all the genes in the gene set. All except one(OFD1) of them are in the rank list and mostly with negative rank, clearly the gene set is overrepresented in the bottom of the rank list even account for the small size.
14 genes out of 18 listed as “CORE ENRICHMENT” genes comprise the leading edge genes. In the enrichment plot of this gene set:
en plot cilium move

these genes correspond to genes that appear subsequent to the peak score.

Compare to the results from the thresholded analysis in Assignment #2
Even though in both analysis I used GO:BP as gene sets(the version through g:profiler and Bader lab, respectively):
There is no overlap in Top 10 positive NES group with the top 10 upregulated gene sets of assignment 2.
There is no overlap in Top 10 negative NES group with the top 10 downregulated gene sets of assignment 2.
Is this a straight forward comparison?
No, there are three reasons:
1, The threshold analysis in assignment 2 used only part of the genes(p-value < 0.05) in the data for gene set overrepresentation test, whereas the non-threshold analysis in this assignment used all the genes for overrepresentation test. So genes with greater p-values can influence the result.
2, The threshold analysis only test gene set overrepresented or not so Fisher’s test is enough, whereas the non-threshold analysis in this assignment test whether some gene set overrepresented in the top and some gene set overrepresented in bottom of the ranklist. The test is Kolmogorov–Smirnov test, and the rank list is fold change rank. The fold change rank is not included threshold analysis, which only consider fold change is postive or negative.
3, I used “model_design_comb” result for the threshold analysis in assignment 2, which combined stimulation vs control and two CAR type in the samples for model. Here in assignment3 I only used stimulation vs control as model.

Thereshold analysis and non-threshold analysis both have pros and cons, from the point of view of top 10 list of result, threshold analysis result found out pathways clearly related to T cell receptor activities while non-threshold analysis still have those patheways picked out (I will show in enrichment map part) but not in top 10.

So this is an example that threshold analysis may quick at pick the top pathways, but non-threshold analysis may good at paint a more complete picture.

Reference:(Reimand et al. 2019).

3 Enrichment map in Cytoscape

3.1 Initial enrichment map

I created the enrichment map using GSEA and Cytoscape (Version: 3.7.2) application running on my local computer.

The GSEA result named “my_analysis.GseaPreranked.1585669102086” is checked in github data directory.

I loaded this result as auotmaticaly named dataset “dataset 1” (this name can’t be changed and carried to cytoscape legend). Enrichment map parameters were: P-value Cutoff: 0.005, FDR Q-value Cutoff: 0.1, Similarity cutoff: overlap coefficient. These parameters already can generate rich and interesting nodes and edges in map, so I don’t have to loose any threshold.

The cretated enrichment map before any manual adjustment:

The number of number of nodes and edges can be seen on Cytoscape network panel in control panel. there are 1163 nodes and 18198 edges.
From the map we can see predominently red nodes and very few blue nodes. The colors were illustrated by the legend produced by cytoscape:
legend

The colors are showing the NES values of the nodes, positive NES nodes painted red, negative NES nodes painted blue. As predicted in section 2, only few nodes with negative NES passed the 0.005 pe-value and 25% FDR thresholds, the main limiting factor is FDR. The following figures used Cytoscape app “legend Creator” to add legend.

From the map we can see many gene sets are significantly overrepresented in genes at the top of rank list, which are upregulated genes in stimulated samples compared to controls, but only a few gene sets are significantly overrepresented in genes at the bottom of rank list, which are downregulated genes in stimulated samples compared to controls. This shows the experiment captured mostly an expression profile more focused on upregulated genes.

The map contains big clustes of densly connected nodes, as well as sparsly connected nodes or singlgtons. It integrated so rich informations, a closer look at part of the network before manual work:

But I have to focus on themes related to the experiment and original paper, they are inside the network as the overlaped image above but nodes not intereted should cut off. So I decided to choose a subset of the network.

Since the experiment is about T cells, I used keyword “T cells” as search bar(top right corner) input, after searching selected nodes and edges, click “Select” drop menu to “Hide Unselected nodes and edges”.

On the resulting network I then add annotations.

3.2 Annotate enrichment map

First I used AutoAnnotate. After installed the app, I created new Annotation set using the following parametrs: “Annotate entire network”(it will not exceed 10 annotations anyway), checked “layout network to prevent cluster overlap”, Advanced options choosed “Use clusterMaker App”, Cluster algorithm choosed “MCL cluster”, edge weight column choosed “similarity_coeffcient”, label Column used “GS_DESCR”, Label Algorithm USed “WordCloud:Adjacent Word(default)”, “Max word per label:3”, “Adjacent word bonus:8”.

This default parameters are pretuned for enrichment map so keep the defalut and trun out to works fine.

The resulting auto annoation before manual adjustment(Only moved the clusters closer):

The big cluster have many overlaps and the cluster name is not accurate, so I did some manual annotation adjustment. The result is the following figure:

The nodes inside the big cluster “T cell regulation/activation” are listed below:

REGULATION OF ACTIVATED T CELL PROLIFERATION%GOBP%GO:0046006
REGULATION OF T CELL PROLIFERATION%GOBP%GO:0042129
REGULATION OF T CELL DIFFERENTIATION%GOBP%GO:0045580
REGULATION OF CD4-POSITIVE, ALPHA-BETA T CELL ACTIVATION%GOBP%GO:2000514
REGULATION OF CD4-POSITIVE, ALPHA-BETA T CELL DIFFERENTIATION%GOBP%GO:0043370
REGULATION OF ALPHA-BETA T CELL DIFFERENTIATION%GOBP%GO:0046637
REGULATION OF REGULATORY T CELL DIFFERENTIATION%GOBP%GO:0045589
REGULATION OF ALPHA-BETA T CELL ACTIVATION%GOBP%GO:0046634
REGULATION OF T CELL CYTOKINE PRODUCTION%GOBP%GO:0002724
REGULATION OF T-HELPER CELL DIFFERENTIATION%GOBP%GO:0045622
POSITIVE REGULATION OF ALPHA-BETA T CELL ACTIVATION%GOBP%GO:0046635
POSITIVE REGULATION OF ALPHA-BETA T CELL DIFFERENTIATION%GOBP%GO:0046638
POSITIVE REGULATION OF CD4-POSITIVE, ALPHA-BETA T CELL ACTIVATION%GOBP%GO:2000516
POSITIVE REGULATION OF T CELL ACTIVATION%GOBP%GO:0050870
POSITIVE REGULATION OF T CELL DIFFERENTIATION%GOBP%GO:0045582
NEGATIVE REGULATION OF T CELL PROLIFERATION%GOBP%GO:0042130
NEGATIVE REGULATION OF T CELL ACTIVATION%GOBP%GO:0050868
NEGATIVE REGULATION OF T CELL DIFFERENTIATION%GOBP%GO:0045581

Many of the nodes connected with thick edges, indicating these gene sets overlap heavyly, and the gene sets are very similar to each other (I build the enrichment map using “overlap” as parameter for similarity coefficient). Since these pathways are very important and attracted great attention, researchers found many pathways from similar batch of genes.

The themes in this network are closely related to the effect to this experiment of Chimeric Antigen Receptor in genral, that is they modify the T cells receptor specificity. Some of the results in assgnment 2 are very similar for examle “regulation of T cell activation GO:0050863”. So the network generally agree with the descripitions in the original paper(Salter et al. 2018 ), but the paper is not a complete enrichment analysis per se, it’s focused on specific topics with clinic implications.

Reference: (Merico et al. 2010).

4 Post analysis, using transcription factor

I created new network using the a subset of the original network, by deleting nodes and edges. Deleted all nodes and edges that doesn’t contain “T cell” by new column filter. Procedure is: click “Select” panel in control panel, click “+” to add new filter, use “Node: Formatted_name” as column selector, select “doesn’t contain” from drop menu, and enter the keyword, and hit “Apply”, then in main Edit dropmenu, select “Delete selected nodes and edges”.

I downloaded “Human_TranscriptionFactors_MSigdb_March_01_2020_symbol.gmt” from Bader lab website to local machine.

To create sognature gene set, in the EnrichmentMap panel, click “add new Signature Gene Sets”, used “Mann-Whitney (two-sided)” test with cutoff 0.05,

Inspired by previous knowledge and from the paper prediction, I specifically selected transcription factor NFAT and FOXO4, which are known to be related to T cell and the experiment, deselect all other TFs. If don’t choose subset of the data in advance then computation will be slow and extreamly hard to use. The resulting network with minor manual adjustment is as follows:

The figures shows NFAT closely related to T cell activation and regulation. In the paper (Rao, Luo, and Hogan 1997) : “Proteins belonging to the NFAT (nuclear factor of activated T cells) family of transcription factors play a central role in inducible gene transcription during the immune response (1, 2, 3, 4). Despite their name, NFAT proteins are expressed not only in T cells, but also in other classes of immune-system cells”. Here the genesets regulated by NFAT are all correspond to genes upregulated in CAR stimulated samples (showing red).

The original paper of this experiment (Salter et al. 2018) described Krüppel-like factor 2 (KLF2) and IL-7 receptor (IL7R), and “KLF2 and IL7R are associated with memory T cell formation and are targets of the FOXO family of transcription factors”. This is consistent with the above figure that connected multiple gene sets with FOXO4 transcriptional factor. IL-7 receptor gene is upregulated, KLF2 is downregulated showing in the rank list file (These are two genes not gene sets).

5 A specific pathway: T Cell Receptor Signaling Pathway

At the focus of this experient and the paper is the pathway T cell receptor (TCR) signaling pathway. CAR works by engineering to modify the T cell receptor to give T cells the new ability to target a specific protein, so activation of T cells upon ligand binding depends on single-chain variable immunoglobulin fragment (scFv) fused to intracellular signaling domains.

T cell receptor signaling pathway had been intensely targeted in previous research. In this enrichment analysis, the Bader lab data contains multiple gene sets related to TCR, such as: “PID_TCR_CALCIUM_PATHWAY”, “CALCIUM SIGNALING IN THE CD4+ TCR PATHWAY”, “PID_CD8_TCR_DOWNSTREAM_PATHWAYY”, “T-CELL ANTIGEN RECEPTOR (TCR) PATHWAY DURING STAPHYLOCOCCUS AUREUS INFECTION”, “DOWNSTREAM TCR SIGNALING”, “T-CELL ANTIGEN RECEPTOR (TCR) SIGNALING PATHWAY”,“TCR SIGNALING%REACTOME”,“BIOCARTA_TCR_PATHWAY”.

From looking at theier names, we may guess these pathways have many overlaped genes, but If compare them directly by enrichment map, only 5 of them passed the pvalue 0.005 and FDR 0.1 thresholds, and only thick overlap exists between “CALCIUM SIGNALING IN THE CD4+ TCR PATHWAY” and “PID_TCR_CALCIUM_PATHWAY” ( I tried loose the restriction of threshold to pvalue 0.08 and FDR 0.2 to include “T-CELL ANTIGEN RECEPTOR (TCR) SIGNALING PATHWAY”, but the building of enrichment map report error, should be bug).

Here I focus on the patheway with highest position on GSEA report: “PID_TCR_CALCIUM_PATHWAY”. this pathway not showing in top 10 but close to top 10 (the postion in reslut is not excaltly the same for every run of GSEA, I saw one time in position 13, one time 18). The actual nominal p-value and FDR for this pathway are both “0.000”, so the reason it’s not on top postion can only be explained by its small size: 28.

GSEA page for the pathway (https://www.gsea-msigdb.org/gsea/msigdb/cards/PID_TCR_CALCIUM_PATHWAY) has its full name “Calcium signaling in the CD4+ TCR pathway”. This is a pathway published by PID, The Pathway Interaction Database (PID, http://pid.nci.nih.gov).

Below is a table extracted from GSEA report, included all gene sets in the leading edge:

GENE SYMBOL RANK IN GENE LIST RANK METRIC SCORE CORE ENRICHMENT
IL3 1 8.5 Yes
IFNG 2 8.191 Yes
CSF2 6 7.64 Yes
PTGS2 7 7.416 Yes
IL2 27 5.826 Yes
CD40LG 28 5.705 Yes
BATF3 31 5.605 Yes
FOSL1 37 5.368 Yes
FOS 51 4.863 Yes
FASLG 74 4.445 Yes
IL4 88 4.289 Yes
IL2RA 196 3.29 Yes
CREM 302 2.883 Yes

The table shows all the leading edge gene sets are high on the rank list, with great postive fold change in the CAR stimulated samples. In terms of enrichment analysis, this is the exact patheway we should be looking for. The patheway with most prominent change in expression profile after “treatment” of the experiment.

The enrichment map shows this pathway has many overlap with “CALCIUM SIGNALING IN THE CD4+ TCR PATHWAY”:

So “CALCIUM SIGNALING IN THE CD4+ TCR PATHWAY” closly follow in the report, it should be research of very similar mechanism of TCR signalling interactions.

The Pathway Interaction Database portal has retired, also can’t find “CALCIUM SIGNALING IN THE CD4+ TCR PATHWAY” on GSEA page but NDEx has page of the same name “Calcium signaling in the CD4 TCR pathway”:

http://www.ndexbio.org/#/network/b792e6db-795c-11e8-a4bf-0ac135e8bacf

NDEx shows the owner of the patheway is PID, so actually they are not two pathways but the two versions of the same patheway.

NDEx page has picture of gene network of this pathway.

The Reactome identifier for TCR signaling patheway is R-HSA-202403, here is the page on Reactome: https://reactome.org/content/detail/R-HSA-202403. And the pathway diagram is:

6 compare to original paper and other publications

The original paper (Salter et al. 2018) has many desciption and discussion of T cell receptor signaling pathway. In fact, although the paper attributes the difference of effect of the two CAR type, CD28/CD3\(\zeta\) and 4-1BB/CD3\(\zeta\) to the difference in activation of divergent signaling pathways through the distinct CAR costimulatory molecule domains, but that is not the case in T cell receptor signaling. The paper decribed experiments used LC-MS/MS to analyze CAR stimulation–induced signaling events in primary human CD8+ T cells, and found that both types of CAR stimulation generate almost identical signaling events. This is consistent with what I found in the enrichment analysis data, where both types CAR stimulated samples have postive fold change in genes significantly associated with the TCR signaling pathway.

The paper described results in the experiments that could explain the molecular mechanism or details of what we observed in enrichment analysis. The authors “performed LC-MS/MS analysis to interrogate the signaling pathways activated in CD28/CD3\(\zeta\) or 4-1BB/CD3\(\zeta\) CARs in an unbiased manner”.

In one experiment, the log2 fold change values of phosphorylation of CD3\(\zeta\) at Tyr83, Tyr111, and Tyr142, and phosphorylation of phospholipase C–\(\gamma\) 1 (PLC-\(\gamma\) 1) at Ser1248 and BCL10 at Ser138 were recorded and compared for stimulated and controls at 10 minutes and 45 minutes. The below figure is the result:

These are details of what exact sugnaling events are in the pathway, theses events then represented in the expression profile, and eventually showed their effects in my enrichment analysis.

Another paper (Fracchia, Pai, and Walsh 2013) explored more machanisms in T cell metabolism through calcium signaling, including the TCR signling. TCR stimulation was described as signaling cascades, resulting in cellular activation and proliferation. One such molecule that is activated is phospholipase Cγ1 (PLCγ1), that is the molecule tested in the (Salter et al. 2018) paper. The binding of antigen/MHC to the TCR complex results in the phosphorylation and activation of PLCγ1, and in turn induce the production of other molecules such as DAG and IP3. In the process cytosolic calcium accmulated, eventually dephosphorylates the cytoplasmic subunits (NFATc) of nuclear factor of activated T cells (NFAT) transcription complexes.

This explains the molecular details of why we see in section 4 the NFAT transcriptional factor NFAT is related to the T cells activation process.

The paper (Huse 2009) offered more details of the role of NFAT in TCR signaling. Also there is a summarize of expression change: “TCR stimulation leads to profound changes in gene expression. Many of these changes are mediated by the transcription factors activator protein 1 (AP1, a heterodimer of Fos and Jun), nuclear factor of activated T cells (NFAT) and nuclear factor-κB (NF-κB). These three factors act together to activate transcription of the interleukin-2 gene”. the figure below from the paper illustrated the TCR signaling pathway at the begining of a cascades of signaling, this explains why TCR played so important a role in influencing the downstream networks, also explains why targeting T cell receptor is a key step in designing of Chimeric Antigen Receptor(CAR).

7 “Dark Matters”

Genes annotated are all the genes included in the gene sets used in the analysis. For a given enrichement analysis, only part of the gene sets actually used, the used gene sets are listed in the report, I then load this list. All the gene sets are the gene sets from the gmt file of the Bader lab data I used.

For reading the gmt file I used the GSA package which use simple C codes to read the file.

if (! requireNamespace("GSA", quietly=TRUE)) {
  install.packages("GSA")
}
library(GSA)
All_genesets<- GSA.read.gmt("./data/Human_GOBP_AllPathways_no_GO_iea_March_01_2020_symbol.gmt")
#Number of gene sets in the file
length(unlist(All_genesets$geneset.descriptions))
## [1] 18405
gene_names <- unlist(All_genesets$genesets, use.names=FALSE)
all_unique_genes <- unique(gene_names)
#number of unique genes in all gene sets
length(all_unique_genes)
## [1] 16517
#All significant genes (With p-values < 0.05) are generated in assignment 1 
down_genes <- read.csv("./data/car_downregulated_genes.txt",header = FALSE,stringsAsFactors=FALSE)
up_genes <- read.csv("./data/car_upregulated_genes.txt", header = FALSE,stringsAsFactors=FALSE)
sig_genes <- unlist(rbind(down_genes,up_genes))
length(sig_genes)
## [1] 874
#Load used gene sets from the enrichment analysis report
used_genesets <- read.csv("./data/my_analysis.GseaPreranked.1585669102086/gene_set_sizes.xls",
                          header = TRUE,stringsAsFactors=FALSE, sep = "\t" )

used_genesets <- used_genesets[used_genesets$STATUS!="Rejected!",1]
length(used_genesets)
## [1] 4965

From the enrichement analysis report:
“Gene set size filters (min=15, max=200) resulted in filtering out 13440 / 18405 gene sets
The remaining 4965 gene sets were used in the analysis”
This is consistent with the above codes shows(only 4965 in 18405 gene sets used).

The codes below shows genes used does not reduced to so dramatic(15371/16517). We only need annotate 874 significant genes with these genes in the pathway.

used_genes <- vector()
for (i in 1:length(All_genesets$geneset.names)){ 
if(unlist(All_genesets$geneset.names)[i] %in% unlist(used_genesets)){
used_genes <- append(used_genes, All_genesets$genesets[i])
}
}   
used_genes <- unique(unlist(used_genes))
length(used_genes)
## [1] 15391

Below is the heatmap of any significant genes that are not annotated to any of the pathways returned in the enrichment analysis. Since one gene is either annotated or not annotated, the heatmap is 1D heatmap.

ann_used <- rep(0, length(sig_genes))
for (i in 1:length(sig_genes)){
  if (sig_genes[i] %in% used_genes) {
    ann_used[i] <- 1
  }
}
ann_used <- matrix(ann_used)
image(ann_used)

Below is heatmap of any significant genes that are not annotated to any pathways in entire set of pathways used for the analysis.

annotated_genes <- rep(0, length(sig_genes))
for (i in 1:length(sig_genes)){
  if (sig_genes[i] %in% all_unique_genes) {
    annotated_genes[i] <- 1
  }
}
annotated_genes <-matrix(annotated_genes)
image(annotated_genes)

The two heatmap slightly different, meaning if we use the entire pathway to annotate, don’t use any cutoff for gene sets size, there will be more gene annotated, but not much.

The red lines are the gens annotated, the yellows are the genes not annotated, the “dark matters”.

8 Inclued Assignment 1 and 2 as child documents

Acknowledgement: Most of the r codes in this report are adapted from lecture 4 notes, authored by BCB420 Instructor Ruth Isserlin.

1. Choose and download the data set

I have chosen an experiment about chimeric antigen receptor ligation transcriptional changes. Its data can be downloaded with:

library(GEOquery)

gse <- getGEO("GSE109161",GSEMatrix=FALSE)

The experiment is interesting because chimeric antigen receptors(CAR), or artificial antigen receptors have been used to find treatment for B cell malignancies and its application was described as revolutionary, although clinic use still in early stage.

There are different kinds of CAR modified T cells that differ in efficacy and toxicity that thought to be resulted from difference in gene expression. In this experiment, two kinds of CAR stimulated T cells(CD28/CD3z and 4-1BB/CD3z) were compared in terms of gene expressions. Result in the experiment may help to improve clinic efficacy and reduce toxicity.

The test condition is CD28/CD3z and 4-1BB/CD3z CAR T cells prepared from healthy donors and stimulated by incubation with anti-CAR beads, control condition is those cells left unstimulated by incubation with control beads.

Seen from the description(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109161), the experiment design has high quality, with 3 biological replicates for each different harvest time(6 and 24 hours), with control groups and two CARs groups data. Human T cells were prepared from healthy donors. Also the experimental platform is mature and new.

The experimental platform, date and other information can be found from the downloaded gse object using:

gse@header$title
gse@gpls$GPL16791@header$organism
gse@gpls$GPL16791@header$lsubmission date
gse@gpls$GPL16791@header$last_update_date
gse@gpls$GPL16791@header$title
gse@gpls$GPL16791@header$technology
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
length(current_gpl_info$series_id)
length(current_gpl_info$sample_id)

The experiment information summarized as:

Exiperiment title : Comparison of transcriptional changes after CD28/CD3z and 4-1BB/CD3z chimeric antigen receptor ligation
Platform title : Illumina HiSeq 2500 (Homo sapiens)
Technology : high-throughput sequencing
Submission data : Mar 14 2013
Last update data : Mar 27 2019
Organims : Homo sapiens
Number of GEO datasets that use this techology : 5943
Number of GEO samples that use this technology : 161048

2. Overview of data

The experiment data can be downloaded from supplementary material using:

library(GEOquery)
sfiles = getGEOSuppFiles('GSE109161')

A brief look at the data:

car_exp = read.csv(rownames(sfiles),header=TRUE, stringsAsFactors=FALSE)
car_exp[1:6,1:5]
##          X X1_A44_6hr_stim_July X2_A44_6hr_ctrl_July X3_A45_6hr_stim_July
## 1     A1BG                97.38               116.48                51.73
## 2 A1BG-AS1               105.61               158.03                65.84
## 3      A2M                 3.00                 8.64                 8.00
## 4  A2M-AS1                 4.00                 1.36                 1.00
## 5   A4GALT                 1.00                 0.00                 0.00
## 6     AAAS               444.00               462.00               416.00
##   X4_A45_6hr_ctrl_July
## 1               123.86
## 2               143.90
## 3                27.00
## 4                 2.00
## 5                 1.00
## 6               592.00

The data dimension:

dim(car_exp)
## [1] 18794    25

There are 18794 rows of expression data so it has good coverage.

The first column is “gene name”(sequence fragment name), other 24 columns are the raw counts for 24 samples(3 replicates * 2 time point * 2 samples for CD28/CD3z and control * 2 samples for 4-1BB/CD3z and control).

Obviously some genes mentioned in the paper have very different expression profile for test and control, for example IFNG( gene for interferon gamma):

car_exp[car_exp$X=="IFNG",c(2,3)]
##      X1_A44_6hr_stim_July X2_A44_6hr_ctrl_July
## 6880                 9773                   93

4-1BB/CD3z stimulated gene transcription counts 9773 while control counts only 93, a difference of two magnitudes.

3. Map identifier

The data table has only one column, and it’s gene name:

The identifiers are following the HGNC convention, and many identifiers are indeed HGNC gene name. For example the IFNG mentioned above, and also recognizable name such as : GZMB, IFNG, IL2, TNF, IL6, KLF2, IL7R and FOXO.

So no further mapping needed.

But I checked the identifiers in online HGNC checker: https://www.genenames.org/tools/multi-symbol-checker/

The result was downloaded as csv file. Checking this file found the majority of the symbols are approved or previously approved, but 3577 symbols are “Unmatched”.

hgnc_checker <- read.csv("./data/hgnc-symbol-check.csv", header = FALSE)

dim(hgnc_checker[hgnc_checker[,2]=="Unmatched",])
## [1] 3577    6

According to assignment requirement: “All rows should have a unique HUGO symbols”, so I think we should delete those rows that don’t match HUGO symbols.

Deletion of those rows are reasonable since counts that not matching any gene expression doesn’t contribute to differential expression analysis.

Looking at the unmatched rows in “hgnc-symbol-check.csv”, I realized all unmatched symbols have “.” in it. For example symbols start with “RP”. I guess it denotes repeats, not really gene name. Lecture slide also mentioned this pattern.

So I decided to delete all rows with “.” in its symbol:

car_exp <- car_exp[-grep("\\.", car_exp[,1]),]
dim(car_exp)
## [1] 15219    25

4. Cleaning

The data had already summarized or collapsed for each gene, there is no replicate(multiple rows that map to the same symbol) :

summarized_gene_counts <- sort(table(car_exp[1]),decreasing =TRUE)
summarized_gene_counts[1:10]
## 
##     A1BG A1BG-AS1      A2M  A2M-AS1   A4GALT     AAAS     AACS    AAED1 
##        1        1        1        1        1        1        1        1 
##    AAGAB     AAK1 
##        1        1

There is no reason to believe that there are outliers to be removed, rows with high raw counts are housekeeping genes, like B2M, high expression are anticipated.

As stated in following section, some outliers in the drawing boxplot process are the interesting part in the data, we should not remove them.

To prepare for normalization, filter weakly expressed and noninformative (e.g., non-aligned) features, first use cpm function in edgeR package to convert raw counts to read per million:

library(edgeR)
cpms = cpm(car_exp[,2:25])
rownames(cpms) <- car_exp[,1]

Then remove features without at least 1 read per million in 3 replicate samples,

keep = rowSums(cpms >1) >=3
car_exp_filtered = car_exp[keep,]

dim(car_exp_filtered)
## [1] 12815    25

15219-12815=2404 rows of data filtered, the remaining data( final coverage 12815) still have good coverage.

5. Representative statistics visualization

To compute statistics to show the data characteristics, first prepare the data groups:

samples <- data.frame(lapply(colnames(car_exp)[2:25],
                             FUN=function(x){unlist(strsplit(x, split = "\\_"))[c(2,4)]}))

colnames(samples) <- colnames(car_exp)[2:25]
rownames(samples) <- c("CAR_type","stim_ctrl")
samples <- data.frame(t(samples))
head(samples)
##                       CAR_type stim_ctrl
## X1_A44_6hr_stim_July       A44      stim
## X2_A44_6hr_ctrl_July       A44      ctrl
## X3_A45_6hr_stim_July       A45      stim
## X4_A45_6hr_ctrl_July       A45      ctrl
## X5_A44_24hr_stim_July      A44      stim
## X6_A44_24hr_ctrl_July      A44      ctrl

There are two types of CARs: A42 and A44 denote 4-1BB/CD3z CARs, A43 and A45 denote CD28/CD3z CARs. “stim_ctrl” attribute denotes the data are from stimulated sample or control group.

Then plot the distribution of the data for each group:

data2plot <- log2(cpm(car_exp_filtered[,2:25]))
box<-boxplot(data2plot, outline=FALSE, xlab = "Samples", ylab = "log2 CPM",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "CAR RNASeq Samples")

abline(h = median(apply(data2plot, 2, median)), col = "green", lwd
       = 0.6, lty = "dashed")

counts_density <- apply(log2(cpm(car_exp_filtered[,2:25])), 2,
                        density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
  xlim <- range(c(xlim, counts_density[[i]]$x));
  ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
     ylab="Smoothing density of log2-CPM", main="", cex.lab =
       0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]],
                                          col=cols[i], lty=ltys[i])
#create legend
legend("topleft",colnames(data2plot),
       col=cols, lty=ltys, cex=0.55,
       border ="blue", text.col = "green4",
       merge = TRUE, bg = "gray90") 

The above boxplot omitted outliers, the first 100 of 6901 outliers can be shown below:

length(box$out)
## [1] 6901
box$out[1:100]
##   [1]      -Inf      -Inf      -Inf 12.428958 -3.836686      -Inf      -Inf
##   [8] -3.836686 12.610026      -Inf -3.836686      -Inf      -Inf      -Inf
##  [15]      -Inf      -Inf      -Inf -3.836686 -3.836686 -3.836686 -3.836686
##  [22] -3.836686      -Inf      -Inf      -Inf -3.271089 -3.836686 -3.836686
##  [29] -3.836686      -Inf      -Inf      -Inf      -Inf -3.836686      -Inf
##  [36] 13.589054      -Inf -3.836686 -7.158614      -Inf -3.836686 -3.836686
##  [43]      -Inf -3.836686 -3.458174      -Inf      -Inf      -Inf -3.836686
##  [50]      -Inf 11.731310      -Inf -3.836686      -Inf -3.836686      -Inf
##  [57] 12.279895 12.108942 -3.836686      -Inf -3.836686 -3.836686 11.748864
##  [64]      -Inf -3.836686 -3.836686 -3.836686 -3.836686      -Inf -3.836686
##  [71] -3.836686      -Inf -3.836686      -Inf -3.836686      -Inf -3.836686
##  [78]      -Inf      -Inf -3.836686      -Inf      -Inf -3.836686 -3.836686
##  [85]      -Inf -3.836686      -Inf -3.836686      -Inf -3.836686      -Inf
##  [92]      -Inf 13.993000 12.069054 11.704441 12.216604 12.134314 -3.836686
##  [99]      -Inf -3.836686

Actually there is no “inf” value in the data, they are infinite only in the sense they are too extreme to draw in the boxplot range.

The outliers are actually what we interested in the data, like the IFNG gene expression data mentioned before create outliers:

log2(cpm(car_exp[car_exp$X=="IFNG",c(2,3)]))
##      X1_A44_6hr_stim_July X2_A44_6hr_ctrl_July
## 6880             19.93157             19.93157

6. Normalization

The original downloaded data already the result of normalization, according to “RNA-seq data analysis” part of the paper that generate these data( Reference 1), “Raw count data were imported into R. edgeR was used to calculate the normalization factors to scale the raw library sizes, followed by a voom transformation from the limma Bioconductor package”.

But the supplemetary data set is only the raw counts data, so I am doing normalization again.

Package edgeR is chosen because it’s the most common package for Trimmed Mean of the M-values (TMM) normalization method, TMM is chosen because as a “Normalization by distribution” method, it choose a sample as a reference sample, then calculate fold changes and absolute expression levels relative to that sample, as described in Reference 2. It is considered as robust and commonly used method.

Use calcNormFactors function in edgeR package to perform normalization:

filtered_data_matrix <- as.matrix(car_exp_filtered[,2:25])
rownames(filtered_data_matrix) <- car_exp_filtered[,1]
d = DGEList(counts=filtered_data_matrix, group=samples$CAR_type)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
normalized_density <- apply(log2(normalized_counts[,1:24]), 2,
                        density)

After normalization, plot the distribution density again:

xlim <- 0; ylim <- 0
for (i in 1:length(normalized_density)) {
  xlim <- range(c(xlim, normalized_density[[i]]$x));
  ylim <- range(c(ylim, normalized_density[[i]]$y))
}
cols <- rainbow(length(normalized_density))
ltys <- rep(1, length(normalized_density))

#plot the first density plot to initialize the plot
plot(normalized_density[[1]], xlim=xlim, ylim=ylim, type="n",
     ylab="Normalized density of log2-CPM", main="", cex.lab =
       0.85)
#plot each line
for (i in 1:length(normalized_density)) lines(normalized_density[[i]],
                                          col=cols[i], lty=ltys[i])
#create legend
legend("topleft",colnames(data2plot),
       col=cols, lty=ltys, cex=0.55,
       border ="blue", text.col = "green4",
       merge = TRUE, bg = "gray90") 

The plot is slightly different from the plot of distribution before normalization, but not much. The two plots have roughly same distribution range.

Also the boxplot:

par(fig=c(0,0.48,0,1), new=TRUE)
data2plot <- log2(car_exp_filtered[,2:25])
box<-boxplot(data2plot, outline=FALSE, xlab = "Samples", ylab = "log2 CPM",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Original")
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd
       = 0.6, lty = "dashed")

par(fig=c(0.52,1,0,1), new=TRUE)
data2plot2 <- log2(cpm(normalized_counts[,1:24]))
box<-boxplot(data2plot2, outline=FALSE, xlab = "Samples", ylab = "log2 CPM",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Normalized")
abline(h = median(apply(data2plot2, 2, median)), col = "green", lwd
       = 0.6, lty = "dashed")

Save the data after normalization as final result, also save the filtered raw counts in case needed later:

result<-as.data.frame(normalized_counts )
result_raw <- as.data.frame(car_exp_filtered )
write.csv(result,"result.csv")
write.csv(result_raw,"result_raw.csv")

7. Compare test and control distribution

The data can be separated to test and control group:

test <- colnames(car_exp)[c(seq(2,24,by=2))]
test_sample <- car_exp[, c("X",test)]
control <- colnames(car_exp)[c(seq(1,25,by=2))]
control_sample <- car_exp[, control]

Try showing density plot for test group data:

And then control group distribution:

No clear difference can be seen from two plots.

The expression differences are specific for some subsets of genes, it can not be seen from overall disttribution plots.

8. Sample separation

Althoght no clear pattern in density distribution plot, sample saparation still can be seen by multidimenstional scaling plot:

plotMDS(d, labels=rownames(samples),
  col = c("darkgreen","blue")[factor(samples$stim_ctrl)], cex=0.5)

Conclusion: Clearly test and control groups cluster. The test group samples have smaller distance within group than between test and control group, showing test sample data are more similar within each other than with data of control group.

Of course this plot is only a visualization, not specific analysis of differential expression.

Reference

  1. Salter, A. I., Ivey, R. G., Kennedy, J. J., Voillet, V., Rajan, A., Alderman, E. J., … Riddell, S. R. (2018). Phosphoproteomic analysis of chimeric antigen receptor signaling reveals kinetic and quantitative differences that affect cell function. Science signaling, 11(544), eaat6753. doi:10.1126/scisignal.aat6753

  2. Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Briefings in bioinformatics, 19(5), 776–792. doi:10.1093/bib/bbx008

Notes for compile: The “data” folder and A2.bib file should be placed in the same directory of Rmd file in oreder for Rmd file to compile.

Acknowledgement: Most of the R codes in this report are adapted from lecture notes, authored by BCB420 Instructor Ruth Isserlin.

1. Introduce data and background from assignment 1

In assignment 1, I chose the data Series GSE109161, data from an experiment about chimeric antigen receptor ligation transcriptional changes. The original paper is (Salter et al. 2018)(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6186424/).

The downloaded supplementary data include 18794 rows of symbols and 24 samples. To regenerate data set from assignment 1, use the following codes:

library(GEOquery)
library(edgeR)

sfiles = getGEOSuppFiles('GSE109161')

car_exp = read.csv(rownames(sfiles),header=TRUE,stringsAsFactors=FALSE)
colnames(car_exp)[1] <- "hgnc_symbol"
cpms = cpm(car_exp[,2:25])
rownames(cpms) <- car_exp[,1]
keep = rowSums(cpms >1) >=3
car_exp_filtered = car_exp[keep,]
head(car_exp_filtered)
##   hgnc_symbol X1_A44_6hr_stim_July X2_A44_6hr_ctrl_July X3_A45_6hr_stim_July
## 1        A1BG                97.38               116.48                51.73
## 2    A1BG-AS1               105.61               158.03                65.84
## 3         A2M                 3.00                 8.64                 8.00
## 6        AAAS               444.00               462.00               416.00
## 7        AACS               398.00               449.00               356.00
## 8       AAED1                69.00                28.00                74.00
##   X4_A45_6hr_ctrl_July X5_A44_24hr_stim_July X6_A44_24hr_ctrl_July
## 1               123.86                 72.87                120.20
## 2               143.90                119.20                157.78
## 3                27.00                  0.00                  0.00
## 6               592.00                871.00                493.00
## 7               446.00                579.00                515.00
## 8                43.00                 48.00                 37.00
##   X7_A45_24hr_stim_July X8_A45_24hr_ctrl_July X9_A42_24hr_stim_July
## 1                 44.57                115.19                 16.18
## 2                 48.56                152.93                 36.97
## 3                  0.00                 14.00                  0.00
## 6                947.00                510.00                754.00
## 7                432.00                467.00                399.00
## 8                 44.00                 37.00                 70.00
##   X10_A42_24hr_ctrl_July X11_A43_24hr_stim_July X12_A43_24hr_ctrl_July
## 1                  89.05                  43.27                 124.22
## 2                 122.90                  33.29                 103.87
## 3                   0.00                   0.00                   0.00
## 6                 361.00                 820.00                 426.00
## 7                 334.00                 511.00                 338.00
## 8                  49.00                 145.00                  60.00
##   X1_A42_6hr_stim_August X2_A42_6hr_ctrl_August X3_A43_6hr_stim_August
## 1                  46.78                  50.01                  17.00
## 2                  96.77                  82.00                  56.76
## 3                   0.00                   1.00                   0.00
## 6                 241.00                 352.00                 149.00
## 7                 174.00                 253.00                 201.00
## 8                  23.00                  20.00                  32.00
##   X4_A43_6hr_ctrl_August X5_A42_24hr_stim_August X6_A42_24hr_ctrl_August
## 1                  37.73                   25.12                   46.76
## 2                  97.03                   60.21                  102.58
## 3                   0.00                    0.00                    0.00
## 6                 335.00                  452.00                  393.00
## 7                 248.00                  321.00                  309.00
## 8                  22.00                   57.00                   30.00
##   X7_A43_24hr_stim_August X8_A43_24hr_ctrl_August X1_A44_6hr_stim_November
## 1                   15.68                   30.60                   170.06
## 2                   50.28                   85.72                   242.45
## 3                    0.00                    1.00                     5.00
## 6                  430.00                  323.00                   699.00
## 7                  321.00                  264.00                   514.00
## 8                   92.00                   39.00                    63.00
##   X2_A44_6hr_ctrl_November X3_A45_6hr_stim_November X4_A45_6hr_ctrl_November
## 1                   155.79                    91.00                   166.82
## 2                   240.22                   143.07                   295.88
## 3                    20.00                     4.00                    47.00
## 6                   700.00                   537.00                   661.00
## 7                   544.00                   432.00                   536.00
## 8                    55.00                    92.00                    74.00
filtered_data_matrix <- as.matrix(car_exp_filtered[,2:25])
rownames(filtered_data_matrix) <- car_exp_filtered[,1]

The main idea of this experiment is that different kinds(4-1BB/CD3z vs CD28/CD3z) of CAR(chimeric antigen receptors) modified T cells differentiate in gene expression, which would affect efficacy and toxicity when used for treatment for B cell malignancies.

2. Sample groups

The data samples are obtained according to the experiment idea and design. The 24 columns in the data are from 24 samples(3 replicates * 2 time points * 2 samples for CD28/CD3z and control * 2 samples for 4-1BB/CD3z and control).

Based on the experiment design, I group the samples as follows:

samples <- data.frame(lapply(colnames(filtered_data_matrix)[1:24],
                             FUN=function(x){unlist(strsplit(x, split = "\\_"))[c(2,3,4)]}))

colnames(samples) <- colnames(filtered_data_matrix)[1:24]
rownames(samples) <- c("CAR_type", "time_point", "stim_ctrl")
samples <- t(samples)
head(samples)
##                       CAR_type time_point stim_ctrl
## X1_A44_6hr_stim_July  "A44"    "6hr"      "stim"   
## X2_A44_6hr_ctrl_July  "A44"    "6hr"      "ctrl"   
## X3_A45_6hr_stim_July  "A45"    "6hr"      "stim"   
## X4_A45_6hr_ctrl_July  "A45"    "6hr"      "ctrl"   
## X5_A44_24hr_stim_July "A44"    "24hr"     "stim"   
## X6_A44_24hr_ctrl_July "A44"    "24hr"     "ctrl"

The “stim_ctrl” column specify stimulated or control cells samples, the “CAR_type” column specify which CAR type used for stimulation. But actually A42 and A44 belong to same type “4-1BB/CD3z”, A43 and A45 belong to same type “CD28/CD3z”, I changed the sample group accordingly.

samples[samples[,"CAR_type"] =="A42",1]="4-1BB/CD3z"
samples[samples[,"CAR_type"] =="A44",1]="4-1BB/CD3z"
samples[samples[,"CAR_type"] =="A43",1]="CD28/CD3z"
samples[samples[,"CAR_type"] =="A45",1]="CD28/CD3z"
samples <- as.data.frame(samples)
head(samples)
##                         CAR_type time_point stim_ctrl
## X1_A44_6hr_stim_July  4-1BB/CD3z        6hr      stim
## X2_A44_6hr_ctrl_July  4-1BB/CD3z        6hr      ctrl
## X3_A45_6hr_stim_July   CD28/CD3z        6hr      stim
## X4_A45_6hr_ctrl_July   CD28/CD3z        6hr      ctrl
## X5_A44_24hr_stim_July 4-1BB/CD3z       24hr      stim
## X6_A44_24hr_ctrl_July 4-1BB/CD3z       24hr      ctrl

3. T-test and MA plot

3.1 T-test

If we are interested in whether genes expressed differentially for two CAR types, we can do a simple T-test.

cartype1 <- rownames(samples[samples$CAR_type=="4-1BB/CD3z",])
cartype2 <- rownames(samples[samples$CAR_type=="CD28/CD3z",])
type1data <- filtered_data_matrix[,cartype1]
type2data <- filtered_data_matrix[,cartype2]

t.test(type1data,type2data)
## 
##  Welch Two Sample t-test
## 
## data:  type1data and type2data
## t = 1.0605, df = 344345, p-value = 0.2889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.92000  36.67201
## sample estimates:
## mean of x mean of y 
##  930.6456  917.7696

The t-test shows genes in general are not significantly differentially expressed. But after normalizing data, and do the test again:

d = DGEList(counts=filtered_data_matrix, group=samples$CAR_type)
d = calcNormFactors(d)
normalized_count_data <- cpm(d)

type1data <- normalized_count_data[,cartype1]
type2data <- normalized_count_data[,cartype2]

t.test(type1data,type2data)
## 
##  Welch Two Sample t-test
## 
## data:  type1data and type2data
## t = -3.5451, df = 344048, p-value = 0.0003926
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.002632 -1.440428
## sample estimates:
## mean of x mean of y 
##  68.20176  71.42329

Now the result is significant for two CAR types comparison, this shows normalization indeed have effect on analysis result.

Across all genes expression significantly different in general, but we are more interested in some genes that are differentially expressed.

For example IFNG gene is mentioned in the paper, I tested specific for this gene.

gene_of_interest <- which(car_exp_filtered$hgnc_symbol ==
                            "IFNG")
IFNG_1 <- normalized_count_data[gene_of_interest, cartype1]
IFNG_2 <- normalized_count_data[gene_of_interest, cartype2]
t.test(IFNG_1,IFNG_2)
## 
##  Welch Two Sample t-test
## 
## data:  IFNG_1 and IFNG_2
## t = -2.0166, df = 13.237, p-value = 0.06449
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3361.4644   112.6217
## sample estimates:
## mean of x mean of y 
##  462.1439 2086.5653

The gene was mentioned in the paper “CD28/CD3z CAR stimulation initiated more marked early transcriptional changes with greater fold increases in the expression”, but from t-test perspective, it didn’t reach exactly the 0.05 threshold.

ST8SIA1 gene on other hand, is quite significant.

gene_of_interest <- which(car_exp_filtered$hgnc_symbol ==
                            "ST8SIA1")
ST8SIA1_1 <- normalized_count_data[gene_of_interest, cartype1]
ST8SIA1_2 <- normalized_count_data[gene_of_interest, cartype2]
t.test(ST8SIA1_1,ST8SIA1_2)
## 
##  Welch Two Sample t-test
## 
## data:  ST8SIA1_1 and ST8SIA1_2
## t = 4.2914, df = 13.444, p-value = 0.0008147
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.960476 5.908618
## sample estimates:
## mean of x mean of y 
##  5.089674  1.155127

The varied expression across all genes can be seen visually by dispersion plot in section 5.2 of this report. Below in section 3.2 I compare the differential expression for two samples.

3.2 MA plot

From the MA plot we can have a quick look at how some genes differentially expressed for different CAR type samples.

limma::plotMA(log2(filtered_data_matrix[,c(1,3)]), ylab="M - ratio log expression",
       main="MAplot two CARtype compare")

The above plot picked a sample from 4-1BB/CD3z group and a sample from CD28/CD3z group. The plot shows the mean log-expression of both samples are in normal range, and the M-ratio shows that quite many genes expressed differentially as not all points hugged in zero line. Some of the genes upregulated, some downrugulated.

The plot drawn from two samples is a quick peek at how many genes might involved in differential expression.

4. MDS plot

The above MAplot compare just two samples. The MDS plot can visualize sample data distances across whole groups.

plotMDS(d, labels=rownames(samples),
        col = c("darkgreen","blue")[factor(samples$stim_ctrl)], cex=0.5)

The above plot is based on stimulate-control group division, and groups are clearly cluster together in the plot, indicating clear difference between stimulate and control groups.

But the main idea of the experiment is: is there difference between different CAR types? So I plot according to CAR_type:

plotMDS(d, labels=rownames(samples),
        col = c("darkgreen","blue")[factor(samples$CAR_type)], cex=0.5)

The plot is more subtle, needs further analysis. In section 5 I intended to find out how many genes are significantly differentially expressed induced by different CAR type.

The time points difference also apparent in the MDS plot.

plotMDS(d, labels=rownames(samples),
        col = c("yellow","red")[factor(samples$time_point)], cex=0.5)

In this plot all control groups hug the top left corner, while all stimulated samples of 24 hours clustered in top right corner, all stimulated 6 hours samples in bottom right corner.

5. Calculate P-Values and dispersion plot, mean-variance plot

5.1 P-values

For this analysis I chose the 0.05 significant threshold as this is the common standard. For multiple hypotheses testing used Benjamni-Hochberg threshold, also very common, less stringent than Bonferroni correction.

In first analysis model design is based on stimulate-control compare.

minimalSet <- ExpressionSet(assayData=normalized_count_data)

model_design_stim <- model.matrix(
  ~ samples$stim_ctrl)

fit_stim <- lmFit(minimalSet, model_design_stim)
fit2_stim <- eBayes(fit_stim,trend=TRUE)
topfit_stim <- topTable(fit2_stim,
                       coef=ncol(model_design_stim),
                       adjust.method = "BH",
                       number = nrow(normalized_count_data))
output_hits_stim <- merge(normalized_count_data[,1],
                         topfit_stim,by.y=0,by.x=1,all.y=TRUE)
#sort by p-value
output_hits_stim <- output_hits_stim[order(output_hits_stim$P.Value),]
output_hits_stim[1:10,]
##                   x       logFC    AveExpr         t      P.Value    adj.P.Val
## 7465         NFKBIA  248.150793 171.300211  16.33966 1.506724e-14 2.163203e-10
## 10346 RP11-93B14.10   -3.454166   2.258039 -14.39810 2.423471e-13 1.692105e-09
## 13800       ZFP36L2 -478.084901 413.718673 -14.14924 3.535776e-13 1.692105e-09
## 12080          TBCC  -49.545661  56.678631 -13.49206 9.838500e-13 3.507575e-09
## 6226         LINGO3  -27.240516  18.587801 -13.35619 1.221556e-12 3.507575e-09
## 1947           CCM2 -137.736132 125.674035 -13.04607 2.014806e-12 4.821095e-09
## 2140         CDKN1B -138.027630 159.850293 -12.73131 3.379554e-12 6.931465e-09
## 10348  RP11-93B14.9  -22.796747  16.657749 -12.44058 5.496525e-12 9.355677e-09
## 8009         PBXIP1 -423.946304 292.432855 -12.40219 5.864811e-12 9.355677e-09
## 4132         FAM8A1  -78.787014  58.041709 -12.17958 8.567447e-12 1.051718e-08
##              B
## 7465  4.615113
## 10346 4.281305
## 13800 4.230520
## 12080 4.085879
## 6226  4.053930
## 1947  3.978174
## 2140  3.897043
## 10348 3.818056
## 8009  3.807322
## 4132  3.743632
length(which(output_hits_stim$P.Value < 0.05))
## [1] 10105
length(which(output_hits_stim$adj.P.Val < 0.05))
## [1] 9665

Thousands of rows P-values, not only passed 0.05 threshold, but also passed the multiple hypothesis standard Benjamni-Hochberg threshold. The result is clearly significant.

Then again we are interested in CAR type samples compare. I need model design based on combination of CAR_type and stim_ctrl.

model_design_comb <- model.matrix(
  ~ samples$stim_ctrl + samples$CAR_type)
fit_comb <- lmFit(minimalSet, model_design_comb)
fit2_comb <- eBayes(fit_comb,trend=TRUE)
topfit_comb <- topTable(fit2_comb,
                        coef=ncol(model_design_comb),
                        adjust.method = "BH",
                        number = nrow(normalized_count_data))
output_hits_comb <- merge(normalized_count_data[,1],
                          topfit_comb,by.y=0,by.x=1,all.y=TRUE)

output_hits_comb <- output_hits_comb[order(output_hits_comb$P.Value),]
length(which(output_hits_comb$P.Value < 0.05))
## [1] 851
length(which(output_hits_comb$adj.P.Val < 0.05))
## [1] 0

This time 851 genes passed 0.05 P-value threshold but none pass the Benjamni-Hochberg multiple hypothesis test standard. I tried (maybe more stringent) Benjamini-Yekutieli standard and the result was the same.

This means the expression difference of two CAR types sample are some what significant but failed to pass multiple hypothesis testing. My thinking is the BH or BY standard maybe too stringent to pass. The difference in expression can be seen clearly by the following heatmap in section 6.

If add time points factor, then the diffenrence is more signifcant.

model_design_time <- model.matrix(~ samples$stim_ctrl 
                                  + samples$CAR_type + samples$time_point)
fit_time <- lmFit(minimalSet, model_design_time)
fit2_time <- eBayes(fit_time,trend=TRUE)
topfit_time <- topTable(fit2_time,
                        coef=ncol(model_design_time),
                        adjust.method = "BH",
                        number = nrow(normalized_count_data))
output_hits_time <- merge(normalized_count_data[,1],
                          topfit_time,by.y=0,by.x=1,all.y=TRUE)

output_hits_time <- output_hits_time[order(output_hits_time$P.Value),]
length(which(output_hits_time$P.Value < 0.05))
## [1] 4713
length(which(output_hits_time$adj.P.Val < 0.05))
## [1] 2131

This time also passed the Benjamni-Hochberg standard. Meaning the CAR type difference is significant if compounding diffenrent time points. For example, the paper mentioned early stage difference of expression induced by CD28/CD3z.

5.2 Dispersion plot and mean-variance plot

Based on above model I can draw a gene dispersion plot.

d <- estimateDisp(d, model_design_comb)
plotBCV(d, col.tagwise = "black", col.common = "red")

The plot shows the gene expression is highly variable relative to the mean among samples. A lot of dots(genes) highly expressed and have high variance.

plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
            show.ave.raw.vars = TRUE,
            NBline=TRUE,
            show.binned.common.disp.vars = TRUE)

The plot shows highly expressed genes tend to have high variance.

6. Heatmap

Heatmap for all data has too high computational load when perform distance calculation for cluster, actually we only interested in genes with significant p-values. Here I build a heatmap out of rows in tophits from combination of CAR_type and stim_ctrl model .

library(ComplexHeatmap)
library(circlize)
heatmap_matrix <- normalized_count_data[,2:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- rownames(normalized_count_data)
colnames(heatmap_matrix) <- colnames(normalized_count_data[,2:ncol(normalized_count_data)])

top_hits <- output_hits_comb$x[output_hits_comb$P.Value<0.05]
top_hits[1:10]
##  [1] "BEND4"    "ADAM22"   "ST8SIA1"  "SGK223"   "GALNT12"  "CLECL1"  
##  [7] "MPZL3"    "COL6A1"   "SLC37A2"  "ARHGAP10"
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[
    which(rownames(heatmap_matrix) %in% top_hits),])))

if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
                           c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
                             max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
    
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col=heatmap_col,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           row_title = "genes",
                           column_title = "samples"
) 
current_heatmap

The heatmap pattern is obvious to human brain, moreover, from the clustered columns we can see stimulated samples cluster together(all on the left half, columns with “stim” label), control samples cluster together(all on right side, columns with “ctrl” label).

Also rows cluster together, indicating some groups of genes upregulated(red block) in stimulated group but the same genes downregulated(blue block) in control group, some genes downregulated in stimulated group but the same genes upregulated in control group.

But the two CAR type samples devide is not so obvious, some A45 and A43 CARs( the 41BB/CD3z type) cluster together on left side, but some A43 and A45 clutser together with A44(CD28/CD3z type).

I need another heatmap that contrast the two CAR types samples, this heatmap will not cluster on colums but put same CAR type columns together.

heatmap_matrix <- cbind(normalized_count_data[,cartype1],normalized_count_data[,cartype2])

heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[
    which(rownames(heatmap_matrix) %in% top_hits),])))

if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
                           c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
                             max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}

current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col=heatmap_col,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           row_title = "genes",
                           column_title = "Left: 4-1BB/CD3z samples; Right: CD28/CD3z  samples"
) 
current_heatmap

Patterns on this heatmap is more subtle. Although clearly columns contrast each other, indicating samples generally differ each other, with blue-lighter control samples interlased with red-darker simulated samples.

The 4-1BB/CD3z samples on the left generally blue-lighter, the right CD28/CD3z samples are generally red-darker, althought contrast is not too strong.

Meaning the CD28/CD3z samples induce relatively higher expression, this is consistant with the data summary of series GSE109161 that states “CD28/CD3z CAR stimulation initiated more marked early transcriptional changes with greater fold increases in the expression”.

Another way to group samples is according to time points. The below heatmap put all 12 control samples on the left, then the 12 stimulated samples devided in two groups: 6 hour and 24 hours.

ctrl_col <- rownames(samples[samples$stim_ctrl=="ctrl",])
ctrl_sample <- filtered_data_matrix[, ctrl_col]
type1_6hr <- rownames(samples[samples$time_point=="6hr" & samples$CAR_type =="4-1BB/CD3z" 
                              & samples$stim_ctrl == "stim",])
type1_6hr_sample <- filtered_data_matrix[, type1_6hr]
type1_24hr <- rownames(samples[samples$time_point=="24hr" & samples$CAR_type =="4-1BB/CD3z" 
                              & samples$stim_ctrl == "stim",])
type1_24hr_sample <- filtered_data_matrix[, type1_24hr]
type2_6hr <- rownames(samples[samples$time_point=="6hr" & samples$CAR_type =="CD28/CD3z" 
                              & samples$stim_ctrl == "stim",])
type2_6hr_sample <- filtered_data_matrix[, type2_6hr]
type2_24hr <- rownames(samples[samples$time_point=="24hr" & samples$CAR_type =="CD28/CD3z" 
                               & samples$stim_ctrl == "stim",])
type2_24hr_sample <- filtered_data_matrix[, type2_24hr]
heatmap_matrix <- cbind(ctrl_sample, type1_6hr_sample, type2_6hr_sample, type1_24hr_sample, type2_24hr_sample)

heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[
    which(rownames(heatmap_matrix) %in% top_hits),])))
 
heatmap_matrix_tophits[is.na(heatmap_matrix_tophits)] <- 0 
if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
                           c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
                             max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}

current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col=heatmap_col,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           row_title = "genes",
                           column_title = "12 control samples, 6hr samples, 24hr samples"
) 
current_heatmap

This heatmap reveals another aspect: regardless CAR type, 6 hours samples show similarty with 6 hours groups, as are 24hours groups. But different genes expressed in 6 hours group than that of 24 hours.

7. Thresholded over-representation analysis

The heatmap visually indicating groups of genes upregulated or downregulated, this section is to find out which of the genes over-representated and involved in which of the pathways or biological processes.

7.1 Find over-representated genes

d = DGEList(counts=filtered_data_matrix, group=samples$CAR_type)
d <- estimateDisp(d, model_design_comb)
d <- calcNormFactors(d)
fit <- glmQLFit(d, model_design_comb)
qlf.cartype <- glmQLFTest(fit, coef='samples$CAR_typeCD28/CD3z')
qlf_output_hits <- topTags(qlf.cartype,sort.by = "PValue",
                           n = nrow(filtered_data_matrix))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 874
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 0
topTags(qlf.cartype)
## Coefficient:  samples$CAR_typeCD28/CD3z 
##                  logFC     logCPM        F       PValue       FDR
## BEND4       -0.9537745  1.5664672 28.55908 1.657809e-05 0.1202228
## IGF2         4.1453906  0.1439115 27.37252 2.204576e-05 0.1202228
## ST8SIA1     -2.5005554  1.7646365 26.83825 2.512143e-05 0.1202228
## ARHGAP10    -0.8458965  4.6492491 24.32142 4.741849e-05 0.1529173
## VCAM1       -2.3245066  3.6607103 23.74380 5.513240e-05 0.1529173
## MAF          1.5175322  4.2579100 23.18494 6.390636e-05 0.1529173
## SLC37A2      1.1831581  0.6906480 20.74732 1.244726e-04 0.2552933
## SPRY4        2.3214545 -0.7261165 19.99192 1.542525e-04 0.2768254
## MIR4435-2HG  1.0458134  3.8236969 19.55622 1.748840e-04 0.2789788
## IL10         3.7555961  2.4874232 19.18699 1.947212e-04 0.2794881
length(which(qlf_output_hits$table$PValue < 0.05
             & qlf_output_hits$table$logFC > 0))
## [1] 414
length(which(qlf_output_hits$table$PValue < 0.05
             & qlf_output_hits$table$logFC < 0))
## [1] 460
qlf_output_hits_withgn <- merge(car_exp[,1],qlf_output_hits,
                                by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -
  log(qlf_output_hits_withgn$PValue,base =10) *
  sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <-
  qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$x[
  which(qlf_output_hits_withgn$PValue < 0.05
        & qlf_output_hits_withgn$logFC > 0)]

downregulated_genes <- qlf_output_hits_withgn$x[
  which(qlf_output_hits_withgn$PValue < 0.05
        & qlf_output_hits_withgn$logFC < 0)]

write.table(x=upregulated_genes,file=("./data/car_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=downregulated_genes,
            file=("./data/car_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

The above code shows number of upregulated and downregulated genes are on roughly the same scale. The result genes are separately stored in list files.

7.2 MA plot

Plot the above result in MA plot:

up <- qlf.cartype[which(qlf.cartype$table$PValue < 0.05 & qlf.cartype$table$logFC > 1),]
down <- qlf.cartype[which(qlf.cartype$table$PValue < 0.05
                              & qlf_output_hits$table$logFC < -1),]

plot(qlf.cartype$table$logCPM, qlf.cartype$table$logFC, cex = .5, main = "CAR type MA plot", xlab = "LogCPM",ylab = "LogFC")
points(up$table$logFC, cex = .5, col = "red")
points(down$table$logFC, cex = .5, col = "blue")

7.3 Over-representation analysis using g:Profiler

I used g:Profiler for this report. Choose this tool because its friendly interfaces and I have already familiar with it. It data sources includes the most sources we are interested and the data are regularly updated and curated.

The process of inquiry:

*In g:Profiler web service, upload the gene list(upregulated and downregulated genes separately).

*In data sources option select the three sources GO:BP, Reactome, Wikipathway. Choose these annotations because I am seeking the biological processes or pathways the selected genes are involved, terms like cellular component and molecular function not relevant.

*In “Significance threshold” select “Benjamini-Hochberg FDR”

*Run query, some genes matched to multiple ensemble ID, use option: “select the Ensembl IDs with the most GO annotations”, rerun query.

Upregulated genes results

Downloaded png for visualized upregulated genes result: Upregulated genes enrichment analysis image Screenshot for upregulated genes result:

Upregulated genes enrichment analysis report Table generated from downloaded csv file, top results for three data resources:

upregulated_result = read.csv("./data/gProfiler_up.csv",header=TRUE,stringsAsFactors=FALSE)

library(knitr)
kable(upregulated_result[c(1:10,704:714,729:738),1:8], type="html")
source term_name term_id adjusted_p_value negative_log10_of_adjusted_p_value term_size query_size intersection_size
1 GO:BP leukocyte chemotaxis GO:0030595 0.0000013 5.887535 224 362 22
2 GO:BP cell surface receptor signaling pathway GO:0007166 0.0000013 5.887535 3128 362 110
3 GO:BP regulation of T cell activation GO:0050863 0.0000013 5.887535 322 362 27
4 GO:BP leukocyte cell-cell adhesion GO:0007159 0.0000013 5.887535 347 362 28
5 GO:BP regulation of leukocyte cell-cell adhesion GO:1903037 0.0000013 5.887535 311 362 26
6 GO:BP cytokine-mediated signaling pathway GO:0019221 0.0000015 5.815457 797 362 44
7 GO:BP positive regulation of cell-cell adhesion GO:0022409 0.0000026 5.580974 260 362 23
8 GO:BP cell chemotaxis GO:0060326 0.0000026 5.580974 307 362 25
9 GO:BP in utero embryonic development GO:0001701 0.0000026 5.580974 379 362 28
10 GO:BP positive regulation of leukocyte cell-cell adhesion GO:1903039 0.0000028 5.552430 221 362 21
704 REAC Interleukin-10 signaling REAC:R-HSA-6783783 0.0000000 8.060253 45 236 13
705 REAC Signaling by Interleukins REAC:R-HSA-449147 0.0000092 5.034679 617 236 37
706 REAC RUNX1 and FOXP3 control the development of regulatory T lymphocytes (Tregs) REAC:R-HSA-8877330 0.0000092 5.034679 10 236 6
707 REAC Cytokine Signaling in Immune system REAC:R-HSA-1280215 0.0000339 4.470146 832 236 43
708 REAC RUNX1 regulates transcription of genes involved in differentiation of myeloid cells REAC:R-HSA-8939246 0.0006371 3.195825 6 236 4
709 REAC RUNX2 regulates genes involved in differentiation of myeloid cells REAC:R-HSA-8941333 0.0016622 2.779313 3 236 3
710 REAC Signaling by MET REAC:R-HSA-6806834 0.0066204 2.179117 77 236 9
711 REAC Interleukin-3, Interleukin-5 and GM-CSF signaling REAC:R-HSA-512988 0.0084428 2.073512 47 236 7
712 REAC RUNX1 regulates transcription of genes involved in interleukin signaling REAC:R-HSA-8939247 0.0095000 2.022278 5 236 3
713 REAC Immune System REAC:R-HSA-168256 0.0095000 2.022278 2146 236 72
714 REAC Regulation of IFNG signaling REAC:R-HSA-877312 0.0095000 2.022278 12 236 4
729 WP Cytokines and Inflammatory Response WP:WP530 0.0000631 4.199797 26 173 8
730 WP Lung fibrosis WP:WP3624 0.0000631 4.199797 63 173 11
731 WP Allograft Rejection WP:WP2328 0.0002256 3.646680 88 173 12
732 WP T-Cell antigen Receptor (TCR) pathway during Staphylococcus aureus infection WP:WP3863 0.0002646 3.577390 62 173 10
733 WP miRNAs involvement in the immune response in sepsis WP:WP4329 0.0065223 2.185597 58 173 8
734 WP VEGFA-VEGFR2 Signaling Pathway WP:WP3888 0.0069235 2.159675 236 173 17
735 WP Control of immune tolerance by vasoactive intestinal peptide WP:WP4484 0.0095010 2.022233 13 173 4
736 WP Photodynamic therapy-induced NF-kB survival signaling WP:WP3617 0.0095010 2.022233 35 173 6
737 WP Hematopoietic Stem Cell Differentiation WP:WP2849 0.0248770 1.604203 61 173 7
738 WP Spinal Cord Injury WP:WP2431 0.0248770 1.604203 117 173 10

Downregulated genes results

Downloaded png for visualized downregulated genes result: Downregulated genes enrichment analysis image Screenshot for downregulated genes result: Dwonregulated genes enrichment analysis report

Table generated from downloaded csv file, top results for three data resources:

downregulated_result = read.csv("./data/gProfiler_down.csv",header=TRUE,stringsAsFactors=FALSE)

library(knitr)
kable(downregulated_result[c(1:10,85:87),1:8], type="html")
source term_name term_id adjusted_p_value negative_log10_of_adjusted_p_value term_size query_size intersection_size
1 GO:BP response to stimulus GO:0050896 0.0000001 7.203263 9379 327 231
2 GO:BP intracellular signal transduction GO:0035556 0.0000012 5.930327 2880 327 97
3 GO:BP signal transduction GO:0007165 0.0000022 5.661500 6210 327 166
4 GO:BP immune response GO:0006955 0.0000022 5.661500 2286 327 81
5 GO:BP immune system process GO:0002376 0.0000022 5.661500 3214 327 103
6 GO:BP regulation of response to stimulus GO:0048583 0.0000075 5.124173 4370 327 126
7 GO:BP cellular response to stimulus GO:0051716 0.0000075 5.124173 7670 327 191
8 GO:BP signaling GO:0023052 0.0000080 5.095914 6681 327 172
9 GO:BP cell communication GO:0007154 0.0000182 4.739610 6706 327 171
10 GO:BP regulation of signal transduction GO:0009966 0.0004784 3.320250 3209 327 94
85 REAC Butyrophilin (BTN) family interactions REAC:R-HSA-8851680 0.0228125 1.641828 12 203 4
86 REAC Immune System REAC:R-HSA-168256 0.0228125 1.641828 2146 203 65
87 WP DNA Damage Response (only ATM dependent) WP:WP710 0.0055985 2.251929 110 139 11

8. Interpretation

1. Stimulated vs control(rest cell) samples comparison

The two groups expression difference is significant and obvious seen from P-values and heatmap.

2. Two CAR type 4-1BB/CD3z vs CD28/CD3z samples comparison

From P-value calculation, heatmap, MDS plot, the difference of expression of two CAR type samples, is not very clear. But the T-test for the two groups data is significant.

P-values calculation in section 5 of two CAR type compare result in 851 genes pass the 0.05 threshold but none pass Benjamni-Hochberg standard. But if taking into account time points, thousands of genes p-values are significant even after multiple hypothesis test.

The second heatmap in section 6, indicating CD28/CD3z samples induce more expression of some genes, consistant with the original paper.

3. The top enrichment analysis results, are the results expected?

By the description of the paper, chimeric antigen receptor modify T cells to recognize cancer cells in order to attack tumour. In the process CAR linking an antigen recognition domain to T cell signalling modules comprised of CD3z to provide signal, and CD28 or 4-1BB to provide costimulation. Looking from this perspective, the top list of pathways related to upregulated genes are all related to the CAR function:

leukocyte chemotaxis, cell surface receptor signaling pathway, regulation of T cell activation, leukocyte cell-cell adhesion, regulation of leukocyte cell-cell adhesion, T-Cell antigen Receptor (TCR) pathway during Staphylococcus aureus infection.

So the results for upregulated genes are highly expected.

For the downregulated genes, the top pathway list include:

Response to stimulus, intracellular signal transduction, signal transduction, immune response, immune system process,etc.

These are roughly related to stimulation in immune cells, but maybe not specifically expected or designed for CAR production. It’s possible they are related to side effect or toxicity of the CAR treatment, and seeking these over-representation results is one purpose of this experiment.

4. Do the over-representation results support conclusions or mechanism discussed in the original paper?

In this assignment I focused on what can be directly showed from the data, and whole groups comparison, the paper was seeking more detailed comparison of specific genes in order to find genes related to efficacy and toxicity of CAR treatment.

The results of this assignment support the conclusion of the original paper in general, as I see the over-representated genes involve processes or pathways related to the effect of CAR treatment, but the paper has more detailed study of specific genes.

5. Other publications that support some of my results

CAR ligation modified T cells described in another paper (???), the CAR ligation produced the cytokines IFN-γ, IL-2, IL-4, IL-10, TNF-α and IL-17.

In my analysis I found the similar results, the report in the above paper surport my finding. For example IL2 was among upregulated gene list, and the result is significant:

qlf_output_hits_withgn[qlf_output_hits_withgn$x=="IL2",]
##        x    logFC   logCPM        F      PValue       FDR     rank
## 5424 IL2 1.472362 3.232688 7.832573 0.009872232 0.7334858 2.005585
qlf_output_hits_withgn[qlf_output_hits_withgn$x=="IL4",]
##        x     logFC   logCPM        F     PValue       FDR     rank
## 5441 IL4 0.9729763 1.658989 3.352031 0.07936862 0.7334858 1.100351
qlf_output_hits_withgn[qlf_output_hits_withgn$x=="IL10",]
##         x    logFC   logCPM        F       PValue       FDR     rank
## 5399 IL10 3.755596 2.487423 19.18699 0.0001947212 0.2794881 3.710587
qlf_output_hits_withgn[qlf_output_hits_withgn$x=="TNF",]
##         x     logFC   logCPM        F     PValue       FDR     rank
## 12552 TNF 0.6878777 6.115726 5.854788 0.02334111 0.7334858 1.631879

Another paper (???) also had the same finding about IL2.

References

9 References

9.1 Software Applications:

GSEA: Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550) and Mootha, Lindgren, et al. (2003, Nat Genet 34, 267-273).

CytoScape: Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks Genome Research 2003 Nov; 13(11):2498-504

Enrichment map: Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation, Merico D, Isserlin R, Stueker O, Emili A, Bader GD PLoS One. 2010 Nov 15;5(11):e13984

AutoAnnotate: Kucera, M., Isserlin, R., Arkhangorodsky, A., & Bader, G. D. (2016). AutoAnnotate: A Cytoscape app for summarizing networks with semantic annotations. F1000Research, 5, 1717. https://doi.org/10.12688/f1000research.9090.1

GSA R package: Title Gene Set Analysis Version 1.03.1 Author Brad Efron and R. Tibshirani Description Gene Set Analysis. Maintainer Rob Tibshirani Suggests impute License LGPL

Legend Creator, a Cytoscape app: Adam Treister (Gladstone Institutes) Alexander Pico (Gladstone Institutes)

9.2 DataBase Enrichment Analysis Pathway database: http://download.baderlab.org/EM_Genesets/

9.3 Peer reviewed Research Paper

Fracchia, Kelley M., Christine Y. Pai, and Craig M. Walsh. 2013. “Modulation of T Cell Metabolism and Function Through Calcium Signaling.” Frontiers in Immunology 4 (October): 324–24. https://doi.org/10.3389/fimmu.2013.00324.

Huse, Morgan. 2009. “The T-Cell-Receptor Signaling Network.” Journal of Cell Science 122 (9): 1269–73. https://doi.org/10.1242/jcs.042762.

Merico, Daniele, Ruth Isserlin, Oliver Stueker, Andrew Emili, and Gary D. Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” PloS One 5 (11): e13984–e13984. https://doi.org/10.1371/journal.pone.0013984.

Rao, Anjana, Chun Luo, and Patrick G. Hogan. 1997. “TRANSCRIPTION Factors of the Nfat Family:Regulation and Function.” Annual Review of Immunology 15 (1): 707–47. https://doi.org/10.1146/annurev.immunol.15.1.707.

Reimand, Jüri, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, et al. 2019. “Pathway Enrichment Analysis and Visualization of Omics Data Using G:Profiler, Gsea, Cytoscape and Enrichmentmap.” Nature Protocols 14 (2): 482–517. https://doi.org/10.1038/s41596-018-0103-9.

Salter, Alexander I, Richard G Ivey, Jacob J Kennedy, Valentin Voillet, Anusha Rajan, Eva J Alderman, Uliana J Voytovich, et al. 2018. “Phosphoproteomic Analysis of Chimeric Antigen Receptor Signaling Reveals Kinetic and Quantitative Differences That Affect Cell Function.” Science Signaling 11 (544): eaat6753.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.